Extensible Markup Language - SW (XML-SW)

Skunkworks 10 February 2002

This version:
http://www.textuality.co/xml/xmlSW.html

Abstract

The Extensible Markup Language (XML) provides a set of rules for defining markup languages intended for use in encoding data objects, and specifies behavior for certain software modules that access them.

Status of this Document

This document is a private skunkworks and has no official standing of any kind, not having been reviewed by any organization in any way.

This draft was assembled by Tim Bray from text edited by himself, John Cowan, Dave Hollander, Andrew Layman, Eve Maler, Jonathan Marsh, Jean Paoli, C. Michael Sperberg-McQueen, and Richard Tobin, the editors of XML's first and second editions, Namespaces in XML, XML Infoset, and XML Base. There should be no suggestion that anybody other than Tim Bray approves of the content or even the existence of the present document.

The copyright statement above applies to almost all the text assembled for this document, but should not be taken as an indication that the W3C approves of the contents or existence of this document.

This document specifies XML SW. The recipe for the construction of XML SW is as follows: XML 1.0 [XML 2e], minus DTDs (and therefore necessarily entities), plus XML Base [XML Base], plus the XML Information Set [XML Infoset], plus XML Namespaces [XMLNamespaces]. The intent is to avoid introducing any modification to the semantics of any of the ingredient specifications, thus all of the syntax and behavior described in this document should be equivalent to that specified in one W3C Recommendation or another.

Table of Contents

1 Introduction
2 Documents, Elements, and Attributes
    2.1 Start-Tags, End-Tags, and Empty-Element Tags
    2.2 XML Namespaces
        2.2.1 Declaring Namespaces
        2.2.2 Using Qualified Names
        2.2.3 Namespace Declaration Scope and Overriding
        2.2.4 Namespace Defaulting
        2.2.5 Uniqueness of Attributes
    2.3 Parent, Child, and Root Elements
    2.4 Reserved Attributes
        2.4.1 White Space Handling
        2.4.2 Language Identification
        2.4.3 Base URI Specification
3 Other Markup
    3.1 Prolog and Document Type Declaration
    3.2 Comments
    3.3 Processing Instructions
    3.4 CDATA Sections
4 Characters and Text
    4.1 Characters
    4.2 Character References
    4.3 Character Data and Markup
    4.4 Common Syntactic Constructs
    4.5 Character Encoding in XML Documents
5 The Information Set
    5.1 Base URI
    5.2 "Unknown" and "No Value"
    5.3 Synthetic Infosets
    5.4 End-of-Line Handling
    5.5 Information Items
        5.5.1 The Document Information Item
        5.5.2 Element Information Items
        5.5.3 Attribute Information Items
        5.5.4 Processing Instruction Information Items
        5.5.5 Character Information Item
        5.5.6 Comment Information Items
        5.5.7 The Document Type Declaration Information Item
        5.5.8 Namespace Information Items
6 Conformance
    6.1 Syntax Checking
    6.2 Use of the XML Information Set by Other Specifications
    6.3 XML Processors and the XML Information Set
7 Notation and Terminology
    7.1 Notation
    7.2 Terminology

Appendices

A References
    A.1 Normative References
    A.2 Other References
B Character Classes
C XML and SGML (Non-Normative)
D Autodetection of Character Encodings (Non-Normative)
    D.1 Detection Without External Encoding Information
    D.2 Priorities in the Presence of External Encoding Information
E Production Notes (Non-Normative)


1 Introduction

Extensible Markup Language, abbreviated XML, describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. By construction, XML documents are conforming Standard Generalized Markup Language (SGML) [ISO 8879] documents.

XML documents are made up of characters, some of which form character data, and some of which form markup. Markup primarily encodes a description of the document's logical structure.

[Definition: A software module called an XML processor is used to read XML documents and provide access to their content and structure.] [Definition: It is assumed that an XML processor is doing its work on behalf of another module, called the application.] This specification describes the required behavior of an XML processor in terms of how it must read XML data and (in 5 The Information Set) the information it must provide to the application.

The process at the W3C that led to this document was originated in 1996 by Jon Bosak, and involved a very large number of contributors from within and without the W3C. Lists of contributors may be found in the specifications on which this one is based.

This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 1766 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version SW and construct computer programs to process it.

This version of the XML specification may be distributed freely, as long as all text and legal notices remain intact.

2 Documents, Elements, and Attributes

[Definition: A data object is an XML document if:]

  1. Taken as a whole, it matches the production labeled document.

  2. It meets the further constraints found in the running text, well-formedness constraints, and normative appendices of this specification.

Document
[1]   document   ::=   prolog element Misc*

An example of an XML document:

<Greeting>Hello world!</Greeting>

[Definition: Each XML document contains one or more elements, the boundaries of which are either delimited by start-tags and end-tags, or, for empty elements, by an empty-element tag. Each element has a type, identified by name, sometimes called its "generic identifier" (GI), and may have a set of attribute specifications.] Each attribute specification has a name and a value.

Element
[2]   element   ::=   EmptyElemTag
| STag content ETag[WFC: Element Type Match]

Well-formedness constraint: Element Type Match

The QName in an element's end-tag must match the element type in the start-tag.

An example containing three XML elements:

<Greeting xml:lang="en"><emph>Hello</emph> 
world! <html:img src="smiley.jpg"/></Greeting>

This specification does not constrain the semantics, use, or (beyond syntax) names of the element types and attributes, except that:

  1. Names beginning with a match to (('X'|'x')('M'|'m')('L'|'l')) are reserved for standardization in this or future versions of this specification.

  2. Widely-used semantics are assigned to three attributes whose names begin with "xml:".

2.1 Start-Tags, End-Tags, and Empty-Element Tags

[Definition: The beginning of every non-empty XML element is marked by a start-tag.]

Start-tag
[3]   STag   ::=   '<' QName (S Attribute)* S? '>'[WFC: Unique Att Spec]
[WFC: Prefix Declared]
[4]   Attribute   ::=   NSAttName Eq AttValue
| QName Eq AttValue[WFC: Prefix Declared]

The QName in the start- and end-tags gives the element's type. [Definition: The QName-AttValue pairs are referred to as the attribute specifications of the element], [Definition: with the QName in each pair referred to as the attribute name] and [Definition: the content of the AttValue (the text between the ' or " delimiters) as the attribute value.] Note that the order of attribute specifications in a start-tag or empty-element tag is not significant.

An example of a start-tag:

<termdef id="dt-dog" term="dog">

[Definition: The end of every element that begins with a start-tag must be marked by an end-tag containing a name that matches the element's type as given in the start-tag:]

End-tag
[5]   ETag   ::=   '</' QName S? '>'[WFC: Prefix Declared]

An example of an end-tag:

</termdef>

[Definition: The text between the start-tag and end-tag is called the element's content:]

Content of Elements
[6]   content   ::=   CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*

[Definition: An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. [Definition: An empty-element tag takes a special form:]

Tags for Empty Elements
[7]   EmptyElemTag   ::=   '<' QName (S Attribute)* S? '/>'[WFC: Unique Att Spec]
[WFC: Prefix Declared]

Empty-element tags may be used for any element which has no content.

Examples of empty elements:

<IMG align="left"
 src="http://www.example.org/Icons/madonna" />
<br></br>
<br/>

2.2 XML Namespaces

The names that appear as element types and attribute names serve as labels for the logical components of an XML document. [Definition: Software modules are often designed to process a particular set of elements and attributes and their content, identifying them using these labels. Let us refer to such a set, understood by some software module, as a markup vocabulary.]

We envision applications of XML where a single XML document may contain elements and attributes from more than one markup vocabulary. One motivation for this is modularity; if such a markup vocabulary exists which is well-understood and for which there is useful software available, it is better to re-use this markup rather than re-invent it.

Such documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the elements and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element type or attribute name.

These considerations require that document constructs should have universal names, whose scope extends beyond their containing document. This section describes a mechanism, XML namespaces, which accomplishes this.

[Definition: An XML namespace is identified by a URI reference [URIRef]; element types and attribute names may be placed in it using prefixing syntax described in this section.]

[Definition: URI references which identify namespaces are considered identical when they are exactly the same character-for-character.] Note that URI references which are not identical in this sense may in fact be functionally equivalent. Examples include URI references which differ only in case.

Names from XML namespaces may appear as qualified names, which contain a single colon, separating the name into a namespace prefix and a local part. The prefix, which is mapped to a URI reference, selects a namespace. The combination of the universally managed URI namespace and the document's own namespace produces identifiers that are universally unique. Mechanisms are provided for prefix scoping and defaulting.

URI references can contain characters not allowed in names, so cannot be used directly as namespace prefixes. Therefore, the namespace prefix serves as a proxy for a URI reference. An attribute-based syntax described below is used to declare the association of the namespace prefix with a URI reference.

2.2.1 Declaring Namespaces

[Definition: A namespace is declared using a family of reserved attributes. Such an attribute's name must either be xmlns or have xmlns: as a prefix. ]

Here is an example namespace declaration, which associates the namespace prefix eg with the namespace name http://example.com/schema:

<x xmlns:eg='http://example.com/schema'>
  <!-- the "eg" prefix is bound to http://example.com/schema
       for the "x" element and contents -->
</x>
Attribute Names for Namespace Declaration
[8]   NSAttName   ::=   PrefixedAttName
| DefaultAttName
[9]   PrefixedAttName   ::=   'xmlns:' NCName[WFC: Leading "XML"]
[10]   DefaultAttName   ::=   'xmlns'
[11]   NCName   ::=   (Letter | '_') (NCNameChar)*
[12]   NCNameChar   ::=   Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender

[Definition: The attribute's value, a URI reference, is the namespace name identifying the namespace.] The namespace name, to serve its intended purpose, should have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).

[Definition: If the attribute is ""xmlns"", then the NCName gives the namespace prefix, used to associate element and attribute names with the namespace name in the attribute value in the scope of the element to which the declaration is attached.] In such declarations, the namespace name may not be empty.

[Definition: If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.] In such a default declaration, the attribute value may be empty. Default namespaces and overriding of declarations are discussed in 2.2.3 Namespace Declaration Scope and Overriding and 2.2.4 Namespace Defaulting.

2.2.2 Using Qualified Names

[Definition: In XML documents conforming to this specification, element types and attribute names may be given as qualified names (nonterminal Name). ]

[Definition: If the qualified name contains a colon, then the portion before the colon, referred to as the namespace prefix (nonterminal Prefix) provides the link to the namespace.] [Definition: While a qualified name may not contain a prefix and colon, it always contains a local part (nonterminal LocalPart) which appears after the colon if there is one, and otherwise makes up the whole of the qualified name.]

If there is a prefix, it must have been associated with a namespace URI reference in a namespace declaration.

An example of a qualified name serving as an element type:

<x xmlns:eg='http://example.com/schema'>
  <!-- the 'price' element's namespace is http://example.com/schema -->
  <eg:price units='Euro'>32.18</eg:price>
</x>

Note that the prefix functions only as a placeholder for a namespace name. Applications should use the namespace name, not the prefix, in constructing names whose scope extends beyond the containing document.

An example of a qualified name serving as an attribute name:

<x xmlns:eg='http://example.com/schema'>
  <!-- the 'taxClass' attribute's namespace is http://example.com/schema -->
  <lineItem eg:taxClass="exempt">Baby food</lineItem>
</x>

Well-formedness constraint: Prefix Declared

If the QName has a namespace prefix, that prefix, unless it is "xml" or "xmlns", must have been declared in a namespace declaration attribute in either the start-tag of the element where the prefix is used or in an an ancestor element (i.e. an element in whose content the prefixed markup occurs). The prefix xml is by definition bound to the namespace name http://www.w3.org/XML/1998/namespace. The prefix "xmlns" is used only for namespace bindings and is not itself bound to any namespace name.

2.2.3 Namespace Declaration Scope and Overriding

The namespace declaration is considered to apply to the element where it is specified and to all elements within the content of that element, unless overridden by another namespace declaration with the same NSAttName part:

<?xml version="SW"?>
<!-- all elements here are explicitly in the HTML namespace -->
<html:html xmlns:html='http://www.w3.org/1999/xhtml'>
  <html:head><html:title>Frobnostication</html:title></html:head>
  <html:body><html:p>Moved to 
    <html:a href='http://frob.com'>here.</html:a></html:p></html:body>
</html:html>

Multiple namespace prefixes can be declared as attributes of a single element, as shown in this example:

<?xml version="SW"?>
<!-- both namespace prefixes are available throughout -->
<bk:book xmlns:bk="http://www.example.com/books/"
         xmlns:isbn='http://www.example.com/isbn/">
    <bk:title>Cheaper by the Dozen</bk:title>
    <isbn:number>1568491379</isbn:number>
</bk:book>

2.2.4 Namespace Defaulting

A default namespace (declared with an attribute named just "xmlns" with no prefix) is considered to apply to the element where it is declared (if that element has no namespace prefix), and to all elements with no prefix within the content of that element. If the URI reference in a default namespace declaration is empty, then unprefixed elements in the scope of the declaration are not considered to be in any namespace. Note that default namespaces do not apply directly to attributes.

<?xml version="SW"?>
<!-- elements are in the HTML namespace, in this case by default -->
<html xmlns='http://www.w3.org/1999/xhtml'>
  <head><title>Frobnostication</title></head>
  <body><p>Moved to 
    <a href='http://example.com/frob/'>here</a>.</p></body>
</html>

Defaulted namespaces can mix with those that are explicitly specified:

<?xml version="SW"?>
<!-- unprefixed element types are from "books" -->
<book xmlns="http://www.example.com/books/"
      xmlns:isbn="http://www.example.com/isbn/">
    <title>Cheaper by the Dozen</title>
    <isbn:number>1568491379</isbn:number>
</book>

A larger example of namespace scoping and defaulting:

<?xml version="SW"?>
<!-- initially, the default namespace is "books" -->
<book xmlns="http://www.example.com/books/"
      xmlns:isbn="http://www.example.com/isbn/">
    <title>Cheaper by the Dozen</title>
    <isbn:number>1568491379</isbn:number>
    <notes>
      <!-- make HTML the default namespace for some commentary -->
      <p xmlns=http://www.w3.org/1999/xhtml'>
          This is a <i>funny</i> book!
      </p>
    </notes>
</book>

The default namespace can be set to the empty string. This has the same effect, within the scope of the declaration, of there being no default namespace.

<?xml version='SW'?>
<Beers>
  <!-- the default namespace is now that of HTML -->
  <table xmlns='http://www.w3.org/TR/REC-html40'>
   <th><td>Name</td><td>Origin</td><td>Description</td></th>
   <tr> 
     <!-- no default namespace inside table cells -->
     <td><brandName xmlns="">Huntsman</brandName></td>
     <td><origin xmlns="">Bath, UK</origin></td>
     <td>
       <details xmlns=""><class>Bitter</class><hop>Fuggles</hop>
         <pro>Wonderful hop, light alcohol, good summer beer</pro>
         <con>Fragile; excessive variance pub to pub</con>
         </details>
        </td>
      </tr>
    </table>
  </Beers>

2.2.5 Uniqueness of Attributes

In XML documents conforming to this specification, no tag may contain two attributes which:

  1. have identical names, or

  2. have qualified names with the same local part and with prefixes which have been bound to namespace names that are identical.

For example, each of the bad start-tags is illegal in the following:

<!-- http://www.example.com/ is bound to n1 and n2 -->
<x xmlns:n1="http://www.example.com/"
   xmlns:n2="http://www.example.com/" >
  <bad a="1"     a="2" />
  <bad n1:a="1"  n2:a="2" />
</x>

However, each of the following is legal, the second because the default namespace does not apply to attribute names:

<!-- http://www.example.com/ is bound to n1 and is the default -->
<x xmlns:n1="http://www.example.com/"
   xmlns="http://www.example.com/" >
  <good a="1"     b="2" />
  <good a="1"     n1:a="2" />
</x>

2.3 Parent, Child, and Root Elements

An XML document matches the document production, which implies that:

  1. It contains one or more elements.

  2. [Definition: There is exactly one element, called the root, or document element, no part of which appears in the content of any other element.] For all other elements, if the start-tag is in the content of another element, the end-tag is in the content of the same element. More simply stated, the elements, delimited by start- and end-tags, nest properly within each other.

[Definition: As a consequence of this, for each non-root element C in the document, there is one other element P in the document such that C is in the content of P, but is not in the content of any other element that is in the content of P. P is referred to as the parent of C, and C as a child of P.]

Example of a root, parent and child elements.

<root>This root element is the parent of the "parent" element.
<parent>This parent element is a child of the "root" element and
parent of the "child" element.
<child>This child element is a child of the "parent" element.</child>
</parent>
</root>

2.4 Reserved Attributes

This section describes several attributes whose names begin "xml:", associating them with predefined semantics useful in a wide variety of applications.

2.4.1 White Space Handling

A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. This is a common application requirement, for example in poetry and source code. The allowed values of this attribute are "default" and "preserve".

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overriden with another instance of the xml:space attribute.

The root element of any document is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value.

An example of the use of xml:space:

<div>
<p xml:space="default">In this paragraph,
  line breaks and indentation mean nothing.</p>
<p xml:space="preserve">Here, space matters:

      \o/
       |
      / \

</p></div>

2.4.2 Language Identification

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. The values of the attribute are language identifiers as defined by [IETF RFC 1766], Tags for the Identification of Languages, or its successor on the IETF Standards Track.

Note:

[IETF RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].

For example:

<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
  <l>Habe nun, ach! Philosophie,</l>
  <l>Juristerei, und Medizin</l>
  <l>und leider auch Theologie</l>
  <l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>

The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content.

2.4.3 Base URI Specification

This section describes a reserved attribute named xml:base with semantics similar to that of HTML BASE, for defining base URIs for parts of XML documents.

The terms base URI and relative URI are used in this section as they are defined in [RFC2396].

The attribute xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity. The value of this attribute is interpreted as a URI Reference as defined in RFC 2396 [RFC2396], after processing according to Section 3.1.

Here is an example of xml:base in a simple document containing XLinks.

<?xml version="SW"?>
<doc xml:base="http://example.org/today/"
     xmlns:xlink="http://www.w3.org/1999/xlink">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <paragraph>See <link xlink:type="simple" xlink:href="new.xml">what's
      new</link>!</paragraph>
    <paragraph>Check out the hot picks of the day!</paragraph>
    <olist xml:base="/hotpicks/">
      <item>
        <link xlink:type="simple" xlink:href="pick1.xml">Hot Pick #1</link>
      </item>
      <item>
        <link xlink:type="simple" xlink:href="pick2.xml">Hot Pick #2</link>
      </item>
      <item>
        <link xlink:type="simple" xlink:href="pick3.xml">Hot Pick #3</link>
      </item>
    </olist>
  </body>
</doc>

The URIs in this example resolve to full URIs as follows:

  • "what's new" resolves to the URI "http://example.org/today/new.xml"

  • "Hot Pick #1" resolves to the URI "http://example.org/hotpicks/pick1.xml"

  • "Hot Pick #2" resolves to the URI "http://example.org/hotpicks/pick2.xml"

  • "Hot Pick #3" resolves to the URI "http://example.org/hotpicks/pick3.xml"

The set of characters allowed in xml:base attributes is the same as for XML, namely [Unicode]. However, some Unicode characters are disallowed from URI references, and thus processors must encode and escape these characters to obtain a valid URI reference from the attribute value.

The disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [RFC2396], except for the number sign (#) and percent sign (%) characters and the square bracket characters re-allowed in [RFC2732]. Disallowed characters must be escaped as follows:

  1. Each disallowed character is converted to UTF-8 [RFC2279] as one or more bytes.

  2. Any bytes corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).

  3. The original character is replaced by the resulting character sequence.

3 Other Markup

This section discribes markup that can appear in an XML document that does not serve to encode the logical structure of the document.

3.1 Prolog and Document Type Declaration

[Definition: XML documents should begin with an XML declaration which specifies the version of XML being used.] For example, the following is a complete XML document.

<?xml version="SW"?> <greeting>Hello, world!</greeting> 

The version number "SW" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "SW" if it does not conform to this version of this specification. Processors may signal an error if they receive documents labeled with versions they do not support.

Prolog
[13]   prolog   ::=   XMLDecl? Misc* (doctypedecl Misc*)?
[14]   XMLDecl   ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
[15]   VersionInfo   ::=   S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
[16]   Eq   ::=   S? '=' S?
[17]   VersionNum   ::=   ([a-zA-Z0-9_.:] | '-')+
[18]   Misc   ::=   Comment | PI | S

[Definition: For compatibility with XML 1.0, a document type declaration may appear in an XML document before the first element.]

Document Type Declaration
[19]   ExternalID   ::=   'SYSTEM' S SystemLiteral
| 'PUBLIC' S PubidLiteral S SystemLiteral
[20]   doctypedecl   ::=   '<!DOCTYPE' S QName (S ExternalID)? S? '>'

An example of an XML document with a document type declaration:

<?xml version="SW"?> 
<!DOCTYPE greeting SYSTEM "hello.dtd"> 
<greeting>Hello, world!</greeting> 

3.2 Comments

[Definition: Comments may appear anywhere in a document outside other markup; For compatibility, the string "--" (double-hyphen) must not occur within comments.]

Comments
[21]   Comment   ::=   '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

An example of a comment:

<!-- declarations for <head> & <body> -->

Note that the grammar does not allow a comment ending in --->. The following example is not well-formed.

<!-- B+, B, or B--->

3.3 Processing Instructions

[Definition: Processing instructions (PIs) allow documents to contain instructions for applications.]

Processing Instructions
[22]   PI   ::=   '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
[23]   PITarget   ::=   Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

PIs are not part of the document's character data, but must be passed through to the application. [Definition: The PI begins with a target (PITarget) used to identify the application to which the instruction is directed.] The target names "XML", "xml", and so on are reserved for standardization in this or future versions of this specification.

3.4 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]

CDATA Sections
[24]   CDSect   ::=   CDStart CData CDEnd
[25]   CDStart   ::=   '<![CDATA['
[26]   CData   ::=   (Char* - (Char* ']]>' Char*))
[27]   CDEnd   ::=   ']]>'

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using "&lt;" and "&amp;". CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>" and "</greeting>" are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]> 

4 Characters and Text

4.1 Characters

[Definition: XML documents contain text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char. The use of "compatibility characters", as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]), is discouraged.]

Character Range
[28]   Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The same encoding must be used for for all the characters in an XML document. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.5 Character Encoding in XML Documents.

4.2 Character References

[Definition: A character reference in an XML document stands for a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.]

Character Reference
[29]   CharRef   ::=   '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'[WFC: Legal Character]

If the character reference begins with "&#x", the digits and letters up to the terminating ";" provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with "&#", the digits up to the terminating ";" provide a decimal representation of the character's code point.

[Definition: For readability, a set of predefined patterns is also provided for the purpose of escaping XML's delimiter characters: &amp; for &, &lt; for <, &gt; for >, &apos; for ', and &quot; for ". This has exactly the same effect as using character references: &#60; for <, &#38; for & and so on.]

4.3 Character Data and Markup

Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, and any white space that is in the document, outside the document element, and not inside any other markup.]

[Definition: All text that is not markup constitutes the character data of the document.]

The ampersand character (&) and the left angle bracket (<) may appear in their literal form only when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&amp;" and "&lt;" respectively. The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup. In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, "]]>".

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "&apos;", and the double-quote character (") as "&quot;".

Character Data
[30]   CharData   ::=   [^<&]* - ([^<&]* ']]>' [^<&]*)

4.4 Common Syntactic Constructs

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.

White Space
[31]   S   ::=   (#x20 | #x9 | #xD | #xA)+

Characters are classified for convenience as letters, digits, or other characters. A letter consists of an alphabetic or syllabic base character or an ideographic character. Full definitions of the specific characters in each class are given in B Character Classes.

[Definition: A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.] Names beginning with the string "xml", or any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

Names
[32]   NameChar   ::=   Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
[33]   Name   ::=   (Letter | '_' | ':') (NameChar)*
[34]   Names   ::=   Name (S Name)*

To support XML namespaces (see 2.2 XML Namespaces), it is necessary to give element types and attribute names as a quoted pair of labels; the Qualified Name (QName) and No-colon Name (NCName) nonterminals support this.

Qualified Name
[35]   QName   ::=    (Prefix ':')? LocalPart
[36]   Prefix   ::=   NCName
[37]   LocalPart   ::=   NCName

Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. Literals are used for specifying the values of attributes (AttValue), and certain components of the document type declaration.

Literals
[38]   AttValue   ::=   '"' ([^<&"] | Reference)* '"'
|  "'" ([^<&'] | Reference)* "'"
[39]   SystemLiteral   ::=   ('"' [^"]* '"') | ("'" [^']* "'")
[40]   PubidLiteral   ::=   '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
[41]   PubidChar   ::=   #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]

4.5 Character Encoding in XML Documents

XML documents must contain Unicode characters, but there are a variety of techniques for encoding characters into bytes for storage. All XML processors must be able to read XML documents in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16.

XML documents encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

Although an XML processor is required to read only XML documents in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read XML documents that use them. In the absence of external character encoding information (such as MIME headers), XML documents which are stored in an encoding other than UTF-8 or UTF-16 must begin with an XML declaration containing an encoding declaration:

Encoding Declaration
[42]   EncodingDecl   ::=   S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
[43]   EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*/* Encoding name contains only Latin characters */

The EncName is the name of the encoding used.

In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-n" (where n is the part number) should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is an error for an XML document including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an XML document which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII XML documents do not strictly need an encoding declaration.

It is a fatal error when an XML processor encounters a data object with an encoding that it is unable to process. It is a fatal error if an XML document is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding. It is also a fatal error if an XML document contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Examples of XML declarations containing encoding declarations:

<?xml version='SW' encoding='UTF-8'?>
<?xml version="SW" encoding='EUC-JP'?>

5 The Information Set

This section defines an abstract data set called the XML Information Set (Infoset). It exists to provide:

  1. A consistent set of definitions for use in other specifications that need to refer to the information in an XML document.

  2. A specification of the information an XML Processor must provide to an Application.

The contents of the information set for an XML document are designed to convey its structure and content as expressed by its markup and character data. However, there are some items of markup which have no effect on the contents of the information set: examples include CDATA sections and character references.

[Definition: An XML document's information set consists of a number of information items; the information set for any XML document contains at least a document information item and several others.] [Definition: An information item is an abstract description of some part of an XML document: each information item has a set of associated named properties.]

The XML Information Set does not require or favor a specific interface or class of interfaces. This specification presents the information set as a modified tree for the sake of clarity and simplicity, but there is no requirement that the XML Information Set be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces, are also capable of providing information conforming to the XML Information Set.

The terms "information set" and "information item" are similar in meaning to the generic terms "tree" and "node", as they are used in computing. However, the former terms are used in this specification to reduce possible confusion with other specific data models. Information items do not map one-to-one with the nodes of the DOM or the "tree" and "nodes" of the XPath data model.

5.1 Base URI

Several information items have a "base URI" property. These are computed as specified in 2.4.3 Base URI Specification. Note that retrieval of a resource may involve redirection at the parser level (for example, in an entity resolver) or below; in this case the base URI is the final URI used to retrieve the resource after all redirection.

The value of these properties does not reflect any URI escaping that may be required for retrieval of the resource, but it may include escaped characters if these were specified in the XML document, or returned by a server in the case of redirection.

In some cases (such as a document read from a string or a pipe) the rules in 2.4.3 Base URI Specification may result in a base URI being application dependent. In these cases this specification does not define the value of the "base URI" property.

When resolving relative URIs the "base URI" property should be used in preference to the values of xml:base attributes; they may be inconsistent in the case of Synthetic Infosets.

5.4 End-of-Line Handling

XML documents are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA).

To simplify the tasks of applications, the information set items corresponding to line ends in the character data and attribute values of an XML document appear in a form "normalized" as follows: all appearances of either the literal two-character sequence "#xD#xA" or a standalone literal #xD are normalized into a single character information item for a #A character.

5.5 Information Items

5.5.1 The Document Information Item

[Definition: There is exactly one document information item in the information set, and all other information items are accessible from the properties of the document information item, either directly or indirectly through the properties of other information items.]

The document information item has the following properties:

  1. children: An ordered list of child information items, in document order. The list contains exactly one element information item, for the document element. The list also contains one processing instruction information item for each processing instruction outside the document element, and one comment information item for each comment outside the document element. If there is a document type declaration, the list also contains a document type declaration information item.

  2. document element: The element information item corresponding to the document element.

  3. base URI: The base URI of the XML document.

  4. character encoding scheme: The name of the character encoding scheme in which the XML document is expressed (see 4.5 Character Encoding in XML Documents).

  5. version: A string representing the XML version of the XML document. This property is derived from the XML declaration optionally present at the beginning of the document entity, and has no value if there is no XML declaration.

5.5.2 Element Information Items

[Definition: There is an element information item for each element appearing in the XML document. One of the element information items is the value of the document element property of the document information item, corresponding to the root of the element tree, and all other element information items are accessible by recursively following its "children" property.]

An element information item has the following properties:

  1. namespace name: The namespace name, if any, of the element type, determined as specified in 2.2 XML Namespaces. If the element does not belong to a namespace, this property has no value.

  2. local name: The local part of the element type. This does not include the namespace prefix (if any) or following colon.

  3. prefix: The namespace prefix part of the element type. If the type is unprefixed, this property has no value. Note that applications should use the namespace name rather than the prefix to identify elements.

  4. children: An ordered list of child information items, in document order. This list contains element, processing instruction, character, and comment information items, one for each element, processing instruction, character, and comment appearing immediately within the element. If the element is empty, this list has no members.

  5. attributes: An unordered set of attribute information items, one for each of the attributes of the element. Namespace declarations do not appear in this set. If the element has no attributes, this set has no members.

  6. namespace attributes: An unordered set of attribute information items, one for each of the namespace declarations attached to this element. A declaration of the form xmlns="", which undeclares the default namespace, counts as a namespace declaration. By definition, all namespace attributes (including those named xmlns, whose "prefix" property has no value) have a namespace URI of http://www.w3.org/2000/xmlns/. If the element has no namespace declarations, this set has no members.

  7. in-scope namespaces: An unordered set of namespace information items, one for each of the namespaces in effect for this element. This set always contains an item with the prefix xml which is by definition bound to the namespace name http://www.w3.org/XML/1998/namespace. It does not contain an item with the prefix xmlns (used for declaring namespaces), since an application can never encounter an element or attribute with that prefix. The set includes namespace items corresponding to all of the members of the "namespace attributes" property, except for any representing a declaration of the form xmlns="", which does not declare a namespace but rather undeclares the default namespace. When resolving the prefixes of qualified names this property should be used in preference to the "namespace attributes" property, which may be inconsistent in the case of Synthetic Infosets.

  8. base URI: The base URI of the element.

  9. parent: The document or element information item which contains this information item in its "children" property.

5.5.3 Attribute Information Items

[Definition: There is an attribute information item for each attribute of each element in an XML document, including those which are namespace declarations. The latter however appear as members of an element's "namespace attributes" property rather than its "attributes" property.]

Attribute values appear in the information set in a "normalized" form, not necessarily identical to the form which appears in the XML document. Normalization is accomplished by applying the algorithm below, or by using some other method that produces the same result.

  1. All line breaks must have been normalized on input to #xA as described in 5.4 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.

  2. Begin with a normalized value consisting of the empty string.

  3. For each character in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:

    • For a character reference, append the referenced character to the normalized value.

    • For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.

    • For another character, append the character to the normalized value.

Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value.

Following are examples of attribute normalization. The attribute specifications in the left column below would be normalized to the character sequences of the right column.

Attribute specification Normalized Sequence
a="

xyz"
#x20 #x20 x y z

An attribute information item has the following properties:

  1. namespace name: The namespace name, if any, of the attribute. Otherwise, this property has no value.

  2. local name: The local part of the attribute name. This does not include any namespace prefix or following colon.

  3. prefix: The namespace prefix part of the attribute name. If the name is unprefixed, this property has no value. Note that applications should use the namespace name rather than the prefix to identify attributes.

  4. normalized value: The attribute value, normalized as described above.

  5. owner element: The element information item which contains this information item in its "attributes" property.

5.5.4 Processing Instruction Information Items

[Definition: There is a processing instruction information item for each processing instruction in an XML document. The XML declaration is not considered to be a processing instruction.]

A processing instruction information item has the following properties:

  1. target: A string representing the target part of the processing instruction (an XML name).

  2. content: A string representing the content of the processing instruction, excluding the target and any white space immediately following it. If there is no such content, the value of this property is an empty string.

  3. parent: The document, element, or document type declaration information item which contains this information item in its "children" property.

5.5.5 Character Information Item

[Definition: There is a character information item for each character that appears within character data in the content of an element, whether literally or as a character reference.]

A character information item has the following properties:

  1. character code: The ISO 10646 character code (in the range 0 to #x10FFFF, though not every value in this range is a legal XML character code) of the character.

  2. parent: The element information item which contains this information item in its "children" property.

5.5.6 Comment Information Items

[Definition: There is optionally a comment information item for each XML comment in an XML document. XML processors are allowed to ignore comments and are not required to provide comment information items.]

A comment information item has the following properties:

  1. content: A string representing the content of the comment.

  2. parent: The document or element information item which contains this information item in its "children" property.

5.5.7 The Document Type Declaration Information Item

[Definition: If the XML document has a document type declaration, then the information set contains a single document type declaration information item.]

A document type declaration information item has the following properties:

  1. system identifier: The SystemLiteral, if one appears in the document type declaration, without any additional URI escaping applied by the processor.

  2. public identifier: The PubidLiteral in the document type declaration, if one is provided, after being processed by replacing each string of white space with a single space character (#x20), and removing leading and trailing white space.

  3. parent The document information item.

5.5.8 Namespace Information Items

[Definition: Each element in an XML document has a namespace information item for each namespace that is in scope for that element.]

A namespace information item has the following properties:

  1. prefix: The prefix whose binding this item describes. Syntactically, this is the part of the attribute name following the xmlns: prefix. If the attribute name is simply xmlns, so that the declaration is of the default namespace, this property has no value.

  2. namespace name: The namespace name to which the prefix is bound.

6 Conformance

6.1 Syntax Checking

Conforming XML processors must detect and report violations of this specification's grammar and well-formedness constraints in the content of data objects (which, if such violations exist, are by definition not XML documents).

When any such violation or any other fatal error is encountered, the XML processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing - i.e. it must not continue making Infoset items available to the application.

6.2 Use of the XML Information Set by Other Specifications

One of the purposes of the Information Set (see 5 The Information Set) is to provide a set of definitions for use by other specifications.

Specifications conformant to this specification, when referring to the Infoset, must:

  • Indicate the information items and properties that are needed to implement the specification.

  • Specify how other information items and properties are treated (for example, they might be passed through unchanged).

  • Note any information required from an XML document that is not defined by the Infoset.

  • Note any difference in the use of terms defined by the Infoset (this should be avoided).

6.3 XML Processors and the XML Information Set

Conforming XML processors must provide a mechanism to make the information items from the information set available to applications with the characteristics described in 5 The Information Set.

7 Notation and Terminology

7.1 Notation

The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form

symbol ::= expression

Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lower case letter. Literal strings are quoted.

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:

#xN

where N is a hexadecimal integer, the expression matches the character in ISO/IEC 10646 whose canonical (UCS-4) code value, when interpreted as an unsigned binary number, has the value indicated. The number of leading zeros in the #xN form is insignificant; the number of leading zeros in the corresponding code value is governed by the character encoding in use and is not significant for XML.

[a-zA-Z], [#xN-#xN]

matches any Char with a value in the range(s) indicated (inclusive).

[abc], [#xN#xN#xN]

matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.

[^a-z], [^#xN-#xN]

matches any Charwith a value outside the range indicated.

[^abc], [^#xN#xN#xN]

matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.

"string"

matches a literal string matching that given inside the double quotes.

'string'

matches a literal string matching that given inside the single quotes.

These symbols may be combined to match more complex patterns as follows, where A and B represent simple expressions:

(expression)

expression is treated as a unit and may be combined as described in this list.

A?

matches A or nothing; optional A.

A B

matches A followed by B. This operator has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D).

A | B

matches A or B but not both.

A - B

matches any string that matches A but does not match B.

A+

matches one or more occurrences of A. Concatenation has higher precedence than alternation; thus A+ | B+ is identical to (A+) | (B+).

A*

matches zero or more occurrences of A. Concatenation has higher precedence than alternation; thus A* | B* is identical to (A*) | (B*).

Other notations used in the productions are:

/* ... */

comment.

[ wfc: ... ]

well-formedness constraint; this identifies by name a constraint associated with a grammar production, violation of which is a fatal error.

7.2 Terminology

The terminology used to describe XML documents is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of an XML processor:

may

[Definition: Conforming documents and XML processors are permitted to but need not behave as described.]

must

[Definition: Conforming documents and XML processors are required to behave as described; otherwise they are in error.]

error

[Definition: A violation of the rules of this specification; results are undefined. Conforming software may detect and report an error and may recover from it.]

fatal error

[Definition: An error which a conforming XML processor must detect and report to the application. ]

at user option

[Definition: Conforming software may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior described.]

well-formedness constraint

[Definition: A rule which applies to all XML documents. Violations of well-formedness constraints are fatal errors.]

match

[Definition: (Of strings or names:) Two strings or names being compared must be identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings. No case folding is performed. (Of strings and rules in the grammar:) A string matches a grammatical production if it belongs to the language generated by that production.]

for compatibility

[Definition: Marks a sentence describing a feature of XML included solely to ensure that XML remains compatible with SGML.]

A References

A.2 Other References

Aho/Ullman
Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Reading: Addison-Wesley, 1986, rpt. corr. 1988.
Berners-Lee et al.
Berners-Lee, T., R. Fielding, and L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax and Semantics. 1997. (Work in progress; see updates to RFC1738.)
Clark
James Clark. Comparison of SGML and XML. See http://www.w3.org/TR/NOTE-sgml-xml-971215.
IANA-LANGCODES
[E58](Internet Assigned Numbers Authority) Registry of Language Tags, ed. Keld Simonsen et al. (See http://www.isi.edu/in-notes/iana/assignments/languages/.)
IETF RFC1738
IETF (Internet Engineering Task Force). RFC 1738: Uniform Resource Locators (URL), ed. T. Berners-Lee, L. Masinter, M. McCahill. 1994. (See http://www.ietf.org/rfc/rfc1738.txt.)
IETF RFC1808
IETF (Internet Engineering Task Force). RFC 1808: Relative Uniform Resource Locators, ed. R. Fielding. 1995. (See http://www.ietf.org/rfc/rfc1808.txt.)
IETF RFC2141
IETF (Internet Engineering Task Force). RFC 2141: URN Syntax, ed. R. Moats. 1997. (See http://www.ietf.org/rfc/rfc2141.txt.)
IETF RFC 2279
[E78]IETF (Internet Engineering Task Force). RFC 2279: UTF-8, a transformation format of ISO 10646, ed. F. Yergeau, 1998. (See http://www.ietf.org/rfc/rfc2279.txt.)
IETF RFC 2376
[E48]IETF (Internet Engineering Task Force). RFC 2376: XML Media Types. ed. E. Whitehead, M. Murata. 1998. (See http://www.ietf.org/rfc/rfc2376.txt.)
IETF RFC 2396
[E66]IETF (Internet Engineering Task Force). RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax. T. Berners-Lee, R. Fielding, L. Masinter. 1998. (See http://www.ietf.org/rfc/rfc2396.txt.)
IETF RFC 2732
[E66]IETF (Internet Engineering Task Force). RFC 2732: Format for Literal IPv6 Addresses in URL's. R. Hinden, B. Carpenter, L. Masinter. 1999. (See http://www.ietf.org/rfc/rfc2732.txt.)
IETF RFC 2781
[E77] IETF (Internet Engineering Task Force). RFC 2781: UTF-16, an encoding of ISO 10646, ed. P. Hoffman, F. Yergeau. 2000. (See http://www.ietf.org/rfc/rfc2781.txt.)
ISO 639
[E38] (International Organization for Standardization). ISO 639:1988 (E). Code for the representation of names of languages. [Geneva]: International Organization for Standardization, 1988.
ISO 3166
[E38] (International Organization for Standardization). ISO 3166-1:1997 (E). Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes [Geneva]: International Organization for Standardization, 1997.
ISO 8879
ISO (International Organization for Standardization). ISO 8879:1986(E). Information processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML). First edition -- 1986-10-15. [Geneva]: International Organization for Standardization, 1986.
ISO/IEC 10744
ISO (International Organization for Standardization). ISO/IEC 10744-1992 (E). Information technology -- Hypermedia/Time-based Structuring Language (HyTime). [Geneva]: International Organization for Standardization, 1992. Extended Facilities Annexe. [Geneva]: International Organization for Standardization, 1996.
WEBSGML
[E43]ISO (International Organization for Standardization). ISO 8879:1986 TC2. Information technology -- Document Description and Processing Languages. [Geneva]: International Organization for Standardization, 1998. (See http://www.sgmlsource.com/8879rev/n0029.htm.)
XML 2e
Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, and Eve Maler, editors. Extensible Markup Language (XML) 1.0 (Second Edition). World Wide Web Consortium, 2000. (See http://www.w3.org/TR/2000/REC-xml-20001006.)
XML Base
Jonathan Marsh, editor. XML Base. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/2000/REC-xml-20001006.)
XML Infoset
John Cowan and Richard Tobin, editors. XML Information Set. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/2000/REC-xml-20001006.)
XML Names
Tim Bray, Dave Hollander, and Andrew Layman, editors. Namespaces in XML. Textuality, Hewlett-Packard, and Microsoft. World Wide Web Consortium, 1999. (See http://www.w3.org/TR/1999/REC-xml-names-19990114/.)

B Character Classes

Following the characteristics defined in the Unicode standard, characters are classed as base characters (among others, these contain the alphabetic characters of the Latin alphabet, ideographic characters, and combining characters (among others, this class contains most diacritics). Digits and extenders are also distinguished.

Characters
[44]   Letter   ::=   BaseChar | Ideographic
[45]   BaseChar   ::=   [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C] | [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4] | [#x04C7-#x04C8] | [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5] | [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 | [#x0561-#x0586] | [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A] | [#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE] | [#x06C0-#x06CE] | [#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] | [#x0905-#x0939] | #x093D | [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993-#x09A8] | [#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] | [#x09DF-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10] | [#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36] | [#x0A38-#x0A39] | [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74] | [#x0A85-#x0A8B] | #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8] | [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] | [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] | [#x0B13-#x0B28] | [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D | [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90] | [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F] | [#x0BA3-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] | [#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39] | [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90] | [#x0C92-#x0CA8] | [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE | [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28] | [#x0D2A-#x0D39] | [#x0D60-#x0D61] | [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84 | [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97] | [#x0E99-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 | [#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49-#x0F69] | [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 | [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] | [#x1F00-#x1F15] | [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D] | [#x1F50-#x1F57] | #x1F59 | #x1F5B | #x1F5D | [#x1F5F-#x1F7D] | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE | [#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] | [#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A-#x212B] | #x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA] | [#x3105-#x312C] | [#xAC00-#xD7A3]
[46]   Ideographic   ::=   [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
[47]   CombiningChar   ::=   [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF | [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670 | [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8] | [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C] | #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983] | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02 | #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48] | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] | [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44] | [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83] | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D] | #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] | #x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] | [#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E | #x0F3F | [#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97 | [#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] | #x3099 | #x309A
[48]   Digit   ::=   [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9] | [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] | [#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]
[49]   Extender   ::=   #x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640 | #x0E46 | #x0EC6 | #x3005 | [#x3031-#x3035] | [#x309D-#x309E] | [#x30FC-#x30FE]

The character classes defined here can be derived from the Unicode 2.0 character database as follows:

  • Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl.

  • Name characters other than Name-start characters must have one of the categories Mc, Me, Mn, Lm, or Nd.

  • Characters in the compatibility area (i.e. with character code greater than #xF900 and less than #xFFFE) are not allowed in XML names.

  • Characters which have a font or compatibility decomposition (i.e. those with a "compatibility formatting tag" in field 5 of the database -- marked by field 5 beginning with a "<") are not allowed.

  • The following characters are treated as name-start characters rather than name characters, because the property file classifies them as Alphabetic: [#x02BB-#x02C1], #x0559, #x06E5, #x06E6.

  • Characters #x20DD-#x20E0 are excluded (in accordance with Unicode 2.0, section 5.14).

  • Character #x00B7 is classified as an extender, because the property list so identifies it.

  • Character #x0387 is added as a name character, because #x00B7 is its canonical equivalent.

  • Characters ':' and '_' are allowed as name-start characters.

  • Characters '-' and '.' are allowed as name characters.

C XML and SGML (Non-Normative)

XML is designed to be a subset of SGML, in that every XML document should also be a conforming SGML document. For a detailed comparison of the additional restrictions that XML places on documents beyond those of SGML, see [Clark].

D Autodetection of Character Encodings (Non-Normative)

The XML encoding declaration functions as an internal label on each XML document, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in an XML document in normal cases. Also, in many cases other sources of information are available in addition to the XML data stream itself. Two cases may be distinguished, depending on whether the XML document is presented to the processor without, or with, any accompanying (external) information. We consider the first case first.

D.1 Detection Without External Encoding Information

Because an XML document not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF". The notation ## is used to denote any byte value except diff="chg">that two consecutive ##s cannot be both 00.

With a Byte Order Mark:

00 00 FE FFUCS-4, big-endian machine (1234 order)
FF FE 00 00UCS-4, little-endian machine (4321 order)
00 00 FF FEUCS-4, unusual octet order (2143)
FE FF 00 00UCS-4, unusual octet order (3412)
FE FF ## ##UTF-16, big-endian
FF FE ## ##UTF-16, little-endian
EF BB BFUTF-8

Without a Byte Order Mark:

00 00 00 3C UCS-4 or other encoding with a 32-bit code unit and ASCII characters encoded as ASCII values, in respectively big-endian (1234), little-endian (4321) and two unusual byte orders (2143 and 3412). The encoding declaration must be read to determine which of UCS-4 or other supported 32-bit encodings applies.
3C 00 00 00
00 00 3C 00
00 3C 00 00
00 3C 00 3FUTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
3C 00 3F 00UTF-16LE or little-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
3C 3F 78 6D UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably
4C 6F A7 94EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use)
Other UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind

Note:

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the XML document. Also, it is possible that new character encodings will be invented that will make it necessary to use the encoding declaration to determine the encoding, in cases where this is not required at present.

This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on).

Because the contents of the encoding declaration are restricted to characters from the ASCII repertoire (however encoded), a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable. Character encodings such as UTF-7 that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.

Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input.

Like any self-labeling system, the XML encoding declaration will not work if any software changes the XML document's character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the XML document.

D.2 Priorities in the Presence of External Encoding Information

The second possible case occurs when the XML document is accompanied by encoding information, as in some file systems and some network protocols. When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC 2376] or its successor, which defines the text/xml and application/xml MIME types and provides some useful guidance. In the interests of interoperability, however, the following rule is recommended.

  • If an XML document is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.

  • If an XML document is delivered with a MIME type of text/xml, then the charset parameter on the MIME type determines the character encoding method; all other heuristics and sources of information are solely for error recovery.

  • If an XML document is delivered with a MIME type of application/xml, then the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.

These rules apply only in the absence of protocol-level documentation; in particular, when the MIME types text/xml and application/xml are defined, the recommendations of the relevant RFC will supersede these rules.

E Production Notes (Non-Normative)

This document was encoded in a slightly modified version of XMLspec DTD (which has documentation available).